## PH/SFT Many-Core Workshop

## (Some of) the Issues facing CERN and HEP in the many-core computing era

"Capacity computing in seven dimensions"



Sverre Jarp CERN openlab CERN – 14 April 2008



### Contents



#### The good news

- CERN's computing infra-structure for LHC is ready
- New silicon processes are on their way

#### The lurking issues

and, there are several!

#### Possible remedies

But are they painless?

#### Conclusions





# Introduction

## The Computer Centre and the Grid



In the Computing Centre, we are ready!



## LHC Computing Grid



Largest Grid service in the world !

- Over 200 sites in40 countries
- Tens of thousands of Linux servers
- Tens of petabytes of storage







# The issues

# Evolution of CERN's computing capacity

- During LEP era (1989 2000):
  - Doubling of compute power every year
  - Initiated with the move from mainframes to RISC systems

#### At the CHEP-95 conference in Rio:

 We made the first recommendation to move to PCs

Sverre Jarp - CERN



Sverre Jarp, Hong Tang, Antony Simmins Computing and Networks Division/CERN 1211 Geneva 23 Switzerland (Svere Jaya @ Cen.CH. Angrig@cen.CH. Anary, Simming@cen.CH.

> Refael Yaari Weizmann Institute, Israel (FHYaari2@Weizmann Weizmann ACIL)

Presented at CHEP-95, 21 September 1995, Rio de Janeiro, Brazil

## 1) Frequency scaling is over!



### The 7 "fat" years of frequency scaling in HEP

- Pentium Pro (1996): 150 MHz
- Pentium 4 (2003): 3.8 GHz (~25x)

#### Since then

- Core 2 systems:
  - ~3 GHz
  - Quad-core





## 2) The Power Wall

- The CERN Computer Centre can "only" supply 2.5MW of electric power
  - Plus 2MW to remove the corresponding heat!

#### Spread over a complex infrastructure:

- CPU servers; Disk servers
- Tape servers + robotic equipment
- Database servers
- Other infrastructure servers
  - AFS, LSF, Windows, Build, etc.
- Network switches and routers

### This limit will be reached in 2009!



## Definition of a hardware core/thread



#### Core

 A complete ensemble of execution logic and cache storage (etc.) as well as register files plus instruction counter (IC) for executing a software process or thread

#### Hardware thread

 Addition of a set of register files plus IC



State: Registers, IC

The sharing of the execution logic can be coarse-grained or fine-grained.

### The move to many-core systems



### Examples of process slots: Sockets \* Cores \* HW-Threads

- Conservative:
  - Dual-socket Intel quad-core (Harpertown):

- 2 \* 4 \* 1 = 8

Dual-socket Intel quad-core (Nehalem):

- 2 \* 4 \* 2 = <mark>16</mark>

- Aggressive:
  - Dual-socket Sun Niagara (T2) processors w/8 cores and 8 threads

- 2 \* 8 \* 8 = **128** 

Quad-socket Intel Nehalem "octocore" with dual threading

- 4 \* 8 \* 2 **= 64** 

Single-socket "Larrabee"

- 1 \* 24 \* 4 = 96 (?)

#### In the near future: Hundreds of process slots!

## 3) Our programming paradigm

Event-level parallelism has been used for decades

Process event-by-event in a single process

#### Advantage

- Large jobs can be split into N efficient processes, each responsible for processing M events
  - Built-in scalability

#### Disadvantage

- Memory must be made available to each process
  - With 2 4 GB per process
  - A dual-socket server with Quad-core processors
    - Needs 16 32 GB (or more) we currently buy only 16!





### Core 2 execution ports





## 4) HEP code density



#### Averages about 1 instruction per cycle.

This "extreme" example shows even less:

High level C++ code  $\rightarrow$ 

if (abs(point[0] - origin[0]) > xhalfsz) return FALSE;

Assembler instructions  $\rightarrow$ 

movsd 16(%rsi), %xmm0 subsd 48(%rdi), %xmm0 // load & subtract andpd \_2ilOfloatpacket.1(%rip), %xmm0 // and with a mask comisd 24(%rdi), %xmm0 // load and compare ibe ..B5.3 # Prob 43% // jump if FALSE

|                              | Cycle | Port 0 | Port 1 | Port 2            | Port 3 | Port 4 | Port 5 |
|------------------------------|-------|--------|--------|-------------------|--------|--------|--------|
|                              | 1     |        |        | load point[0]     |        |        |        |
| Same                         | 2     |        |        | load origin[0]    |        |        |        |
| instructions                 | 3     |        |        |                   |        |        |        |
| laid out                     | 4     |        |        |                   |        |        |        |
| according to<br>latencies on | 5     |        |        |                   |        |        |        |
| the Core 2                   | 6     |        | subsd  | load float-packet |        |        |        |
| processor →                  | 7     |        |        |                   |        |        |        |
|                              | 8     |        |        | load xhalfsz      |        |        |        |
| NB: Out-of-<br>order         | 9     |        |        |                   |        |        |        |
| scheduling                   | 10    | andpd  |        |                   |        |        |        |
| not taken                    | 11    |        |        |                   |        |        |        |
| into account.                | 12    | comisd |        |                   |        |        |        |
|                              | 13    |        |        |                   |        |        | jbe    |

PH/SFT Many-Core Workshop – 14 April 2008



# Possible remedies

## 1) More efficient memory footprint





## 2) Embrace parallelism

#### Our contribution:

 Two openlab workshops arranged together w/Intel in 2007

#### Each event:

- 1 day lectures, 1 day exercises
- Multiple lecturers (Intel + CERN), 45 participants, 20 people oversubscribed
- Survey: 100% said expectations met
- Next workshop: 29/30 May 2008
- Licenses for the Intel Threading Tools (and other SW products) available
  - to all CERN users



intel

#### Multi-threading and Parallelism WORKSHOP

#### th-5th of October 2007, CERN

A second instance of the Multi-threading and Parallelium Workshop will be held on the dop and Sht of October 2007 at CDNs. Esperts from loter will lead the two day scent and being you improve your intereledge by opplanning the kay intrincicles of parallelip programming and presenting the most efficient solutions to popular multi-threading problems.

#### Event highlights:

- Day 1, Fundamental aspects of multithreaded and parallel computing
  - a The most to built sale and indirated to be thereit
  - a responsed permeter's and much ferences givenings
  - a Treasure programming the training and an additing term
  - Electric and Public Departments
  - Day 3 Mantheortain
  - Q&A with Initial experts and feature. From the generative advances





PH/SFT Many-Core Workshop – 14 April 2008

# Part 1: Opportunities for scaling performance inside a core

- First three dimensions:
- Data parallelism via
  - Loop/straight-line vectorization
  - Superscalar: Fill the ports
  - Pipelined: Fill the stages
  - SIMD: Fill the register width







## 3) Bet on Symmetric Multithreading



#### Provided the memory issue is solved

We could easily tolerate 4x SMT !

| Cycle | Port  | 0     | Port 1 |      | Port 2    |  | Port 3    | Р      | Port 4   |      | Port 5    |   |            |        |     |
|-------|-------|-------|--------|------|-----------|--|-----------|--------|----------|------|-----------|---|------------|--------|-----|
| 1     | Cycle | Por   | rt O   | Port | Port 1 Po |  | ort 2 Por |        | }        | Port | 4         | F | Port 5     |        |     |
| 2     | 1     | Cycle | Ро     | rt 0 | D Port 1  |  | Port 2    |        | Port 3   |      | Port      |   | t 4 Port 5 |        |     |
| 3     | 2     | 1     | Cycle  | e F  | Port 0    |  | Port 1    | Port 2 |          |      | Port 3 Po |   | rt 4       | Port 5 |     |
| 4     | 3     | 2     | 1      |      |           |  |           | load p | oint[0]  |      |           |   |            |        |     |
| 5     | 4     | 3     | 2      |      |           |  |           | load o | rigin[0] |      |           |   |            |        |     |
| 6     | 5     | 4     | 3      |      |           |  |           |        |          |      |           |   |            |        |     |
| 7     | 6     | 5     | 4      |      |           |  |           |        |          |      |           |   |            |        |     |
|       |       | 6     | 5      |      |           |  |           |        |          |      |           |   |            |        |     |
| 8     | 7     |       | 6      |      |           |  | subsd     | load   | float-   |      |           |   |            |        |     |
| 9     | 8     | 7     | _      |      |           |  |           | pa     | cket     |      |           |   |            |        |     |
| 10    | 9     | 8     | 7      |      |           |  |           |        |          |      |           |   |            |        |     |
| 11    | 10    | 9     | 8      |      |           |  |           | load   | khalfsz  |      |           |   |            |        |     |
| 12    | 11    | 10    | 9      |      |           |  |           |        |          |      |           |   |            |        |     |
| 13    | 12    | 11    | 10     | a    | ndpd      |  |           |        |          |      |           |   |            |        |     |
|       | 13    | 12    | 11     |      |           |  |           |        |          |      |           |   |            |        |     |
|       |       | 13    | 12     | C    | omisd     |  |           |        |          |      |           |   |            |        |     |
|       |       |       | 13     |      |           |  |           |        |          |      |           |   |            |        | jbe |

## Part 2: Parallel execution across multithreaded cores





- First three dimensions
- We move to the next level:
  - Three additional dimensions inside a node:
    - HW threads
    - Processor cores
    - Sockets



#### Ideal for thread-level parallelism

Seventh dimension represented by multiple nodes.

## Rethink concurrency in HEP



- We are "blessed" with lots of it:
  - Events
  - Particles, tracks and vertices
  - Physics processes
  - I/O streams (Trees, branches)
  - Buffer manipulations (also compaction, etc.)
  - Fitting variables
  - Partial sums, partial histograms
  - (Your favorite comes here)

## C++ multithreading support



- Beyond auto-vectorization/auto-parallelization,
- Large selection of low-level tools:
  - OpenMP
  - MPI
  - pthreads/Windows threads
  - Threading Building Blocks (TBB)
  - TOP-C (from NE University)
  - RapidMind
  - Ct (in preparation)
  - etc.

Complementary tools available at CERN: Intel Thread Checker, Thread Profiler Linux perfmon2 (Stéphane Eranian)

## Examples of parallelism: CBM track fitting

I.Kisel/GSI: "Fast SIMDized Kalman filter based track fit" http://www-linux.gsi.de/~ikisel/reco/CBM/ DOC-2007-Mar-127-1.pdf

- Extracted from CBM's High Level Trigger Code
  - Originally ported to IBM's Cell processor
- Tracing particles in a magnetic field
  - Embarrassingly parallel code
- Re-optimization on Intel Core systems
  - Step 1: use SSE vectors instead of scalars
    - Operator overloading allows seamless change of data types, even between primitives (e.g. float) and classes
    - Two classes
      - P4\_F32vec4 packed single; operator + = \_mm\_add\_ps
      - P4\_F64vec2 packed double; operator + = \_mm\_add\_pd
  - Step 2: add multithreading (via TBB)
    - Enable scaling with core count





## **CBM HLT benchmark runs**



#### Real fit time/track (µs) as a function of the core count:



## Examples of parallelism: RooFit (1)



#### Example of Data Analysis (Fitting) in BaBar (SLAC)

- Uses MPI to run scatter/gather
  - Based on the Negative-Log Likelihood function which requires the calculation of separate values for each free parameter in each minimization step



From B.Meadows's talk at RooFit Mini Workshop @ SLAC (December 2007): http://www.slac.stanford.edu/BFROOT/www/doc/Workshops/2007/BaBar\_RooFit/Agenda.html

## RooFit (2)



#### It works well in case of large number of parameters

Gain ~ NCPU\*(NPAR + 2) / (NPAR + 2\*NCPU) Max. Gain = NCPU



## Programming strategies/priorities

#### As I see them:

- Get memory usage (per process) under control
  - To allow higher multiprogramming level per server
- Draw maximum benefit from hardware threading
- Introduce coarse-grained software multithreading
  - To allow further scaling with large core counts
- Revisit data parallel constructs at the very base
  - Gain performance inside each core
- Use appropriate tools (perfmon2/Thread Profiler, etc.)
  - To monitor detailed program behaviour

### Conclusions



- In spite of the fact that we have thousands of servers locally (and tens of thousands in the Grid), we are confronted by several computing issues!
- Some solutions are easier than others:
  - Switch on SMT in the BIOS
  - Reduce memory foot-print
  - Gradually introduce parallelism (across threads/cores/sockets)
  - Build an additional (5 MW?) computer centre
  - Revisit vectors (data parallelism)
- But, above all, maintain software scalability and portability
- In all cases, we must master the 7 hardware dimensions!

